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Abstract. Accurate reconstruction of phylogenies remains a key chal- 
lenge in evolutionary biology. Most biologically plausible formulations 
of the problem are formally NP-hard, with no known efficient solution. 
The standard in practice are fast heuristic methods that are empirically 
known to work very well in general, but can yield results arbitrarily far 
from optimal. Practical exact methods, which yield exponential worst- 
case running times but generally much better times in practice, provide 
an important alternative. We report progress in this direction by intro- 
ducing a provably optimal method for the weighted multi-state maximum 
parsimony phylogeny problem. The method is based on generalizing the 
notion of the Buneman graph, a construction key to efficient exact meth- 
ods for binary sequences, so as to apply to sequences with arbitrary finite 
numbers of states with arbitrary state transition weights. We implement 
an integer linear programming (ILP) method for the multi-state prob- 
lem using this generalized Buneman graph and demonstrate that the 
resulting method is able to solve data sets that are intractable by prior 
exact methods in run times comparable with popular heuristics. Our 
work provides the first method for provably optimal maximum parsi- 
mony phylogeny inference that is practical for multi-state data sets of 
more than a few characters. 

Introduction 

One of the fundamental problems in computational biology is that of infer- 
ring evolutionary relationships between a set of observed amino acid sequences 
or taxa. These evolutionary relationships are commonly represented by a tree 
(phylogeny) describing the descent of all observed taxa from a common ancestor, 
a reasonable model provided we are working with sequences over small enough 
regions or distant enough relationships that we can neglect recombination or 
other sources of reticulation [1]. Several criteria have been implemented in the 
literature for inferring phylogenies, of which one of the most popular is maximum 



parsimony (MP). Maximum parsimony defines the tree(s) with the fewest muta- 
tions as the optimum, generally a reasonable assumption for short time-scales or 
conserved sequences. It is a simple, non-parametric criterion, as opposed to com- 
mon maximum likelihood models or various popular distance- based methods [2] . 
Nonetheless, MP is known to be NP-hard [3] and practical implementations of 
MP are therefore generally based on heuristics which do not guarantee optimal 
solutions. 

For sequences where each site or character is expressed over a set of discrete 
states, MP is equivalent to finding a minimum Steiner tree displaying the input 
taxa. For example, general DNA sequences can be expressed as strings of four 
nucleotide states and proteins as strings of 20 amino acid states. Recently, Srid- 
har et al. [4] used integer linear programming to efficiently find global optima for 
the special case of sequences with binary characters, which are important when 
analyzing single nucleotide polymorphism (SNP) data. The solution was made 
tractable in practice in large part by a pruning scheme proposed by Buneman 
and extended by others [5-7] . The so-called Buneman graph B for a given set of 
observed strings is an induced sub-graph of the complete graph Q (whose nodes 
represent all possible strings of mutations) such that B C Q still contains all 
distinct minimum Steiner trees for the observed data. By finding the Buneman 
graph, one can often greatly restrict the space of possible solutions to the Steiner 
tree problem. While there have been prior generalizations of the Buneman graph 
to non-binary characters [8,9], they do not provide any comparable guarantees 
usable for accelerating Steiner tree inference. 

In this paper, we provide a new generalization of the definition of Buneman 
graph for any finite number of states that guarantees the resulting graph will 
contain all distinct minimum Steiner trees of the multi-state input set. Further, 
we allow transitions between different states to have independent weights. We 
then utilize the integer linear programming techniques developed in [4] to find 
provably optimal solutions to the multi-state MP phylogeny problem. We vali- 
date our method on four specific data sets chosen to exhibit different levels of 
difficulty: a set of nucleotide sequences from Oryza rufipogon [10], a set of hu- 
man mt-DNA sequences representing prehistoric settlements in Australia [11], a 
set of HIV-1 reverse transcriptase amino acid sequences and, finally, a 500 taxa 
human mitochondrial DNA data set. We further compare the performance of 
our method, in terms of both accuracy and efficiency, with leading heuristics, 
PAUP* [12] and the pars program of PHYLIP [13], showing our method to yield 
comparable and often far superior run times on non-trivial data sets. 

Methods 

Notation & Background 

Let H be an input matrix that specifies a set of N taxa \i over a set of m 
characters C = {ci, . . . c m } such that Hij represents the j th character of the i th 
taxon. The taxa of H represent the terminal nodes of the Steiner tree inference. 
Further, let nk be the number of admissible states of the k th character Cfc. The set 



of all possible states is the space S = {0, 1, . . . n\ — 1} • • • <S>{0, 1, . . . n m — 1}. 
We will represent the i th character of any element b G S, by (6)j. The state 
space S can be represented as a graph Q = (Vg,Eg) with the vertex set Vg = S 
and edge set Eg = {(u,v)\u,v G <S, Y^T p eC ( v )p\ = 1}: w bere 5[o, 6] = if 

a = b and 1 otherwise. Furthermore, let a — {a p \c p G C} be a set of weights, 
such that a p [i, j] represents an edge length for a transition between states i,j G 
{0, . . .rip — 1} for character c p . We will assume that these lengths are positive 
(states that share zero edge length are indistinguishable), symmetric in i,j and 
satisfy the triangle inequality. 

otp[i,j] +a p \j,k] > a p [i,k] V i, j, k G {0, . . . n p - 1} (1) 

Non-negativity and symmetry are basic properties for any reasonable definition 
of length. If a particular triplet of states (say k) does not satisfy the triangle 
inequality in equation 1, we can set a p [i,k] = a p [i,j] + a p [j 7 k] and still ensure 
that the shortest path connecting any set of states remains the same. We can 
now define a distance d a over Q, such that for any two elements u, v G Vg 

m 

d a [u, v] = ^2 a p [(u) p , (v)p\ (2) 

pGC 

Given any subgraph K = (Vk,Ek) of Q, we can define the length of K 
to be the sum of the lengths of all the edges L(K) = J2(u v)eE K d a [u,v]. The 
maximum parsimony phylogeny problem for \ i s equivalent to constructing the 
minimum Steiner tree T* displaying the set of all specified taxa \, i.e., any tree 
T*(14,_E*) such that x C K and L(T*) is minimum. Note that T* need not be 
unique. 

Pre-processing 

Before we construct the generalized Buneman graph corresponding to an input, 
we perform a basic pre-processing of the data. The set of taxa in the input H 
might not all be distinct over the length of sequence represented in H . These 
correspond to identical rows in H and are eliminated. Similarly, characters that 
do not mutate for any taxa do not affect the true phylogeny and can be removed. 
Furthermore, if two characters are expressed identically in \ (modulo a relabeling 
of the states), we will represent them by a single character with each edge length 
replaced by the sum of the edge lengths of the individual characters. In case there 
are n such non-distinct characters, one of them is given edge lengths equal to 
the sum of the corresponding edges in each of the n characters and the rest 
are discarded. These basic pre-processing steps are often useful in considerably 
reducing the size of input. 

Buneman graph 

The Buneman graph was introduced as a pruning of the complete graph for 
the special case of binary valued characters. For this special case it is useful 



to introduce the notion of binary splits c p (0)|c p (l) for each character c p £ C, 
which partition the set of taxa \ two sets c p (0) and c p (l) corresponding 
to the value expressed by c p . Each of these sets is called a block of c p . Each 
vertex of the Buneman graph B can be represented by an m-tuple of blocks 
[ci(ii), 02(12), • • • , c m (i m )}, where ij = or 1, for j £ {1,2,... m}. To construct 
the Buneman graph, a rule is defined for discarding/retaining the subset of 
vertices contained in each pair of overlapping blocks [c p (i p ), c q (i q )] for each pair 
of characters (c p ,c q ) £ CxC. All vertices which satisfy c p (i p )ric q (i q ) — for any 
pair of characters (c p , c q ) can be eliminated, while those for which c p {i p )C\c q {i q ) ^ 
for all [cp(ip), c q (i q )\ are retained. Buneman previously established for the 
binary case that the retained vertex set will contain all terminal and Steiner 
nodes of all distinct minimum length Steiner trees. 

We extend this prior result to the weighted multi-state case by presenting an 
algorithm analogous to the binary case to construct a graph with these proper- 
ties. 

Algorithm for constructing the generalized Buneman graph 

Briefly, the algorithm looks at the input matrix projected onto each distinct 
pair of characters p, q and constructs a n p x n q matrix C(p, q), where the i x j 
element C(p, q)ij is 1 only if there is at least one taxon t such that (t) p — i and 
(t) q = j and zero otherwise. The algorithm then implements a rule for each such 
pair of characters p, q that allows us to enumerate the possible states of those 
characters in any optimal Steiner tree. For clarity, we will assume that each state 
for each character is expressed in at least one input taxon, since states that are 
not present in any taxa cannot be present in a minimum length tree because of 
the triangle inequality. The rule is defined by a n p x n q matrix R(p, q) determined 
by the following algorithm : 

1. R(p,q) i:j <r- C(p,q)ij for all i £ {0, 1, . . . n p - 1} and j £ {0, l,...n g — 1}. 

2. If all non-zero entries in C(p,q) are contained in the set of elements 

(U k C(p,q) lk )\J(U k C(p, q) ki ) 

for a unique pair i £ {0, 1, . . . n p — 1} and j £ {0, 1, . . . n q — 1} then R(p, q) xy <— 
1 for all x, y such that either x = i or y = j (See Fig 1 where this pair of 
states are denoted i pq and i qp .) 

3. If the condition in step 2 is not satisfied then set R(p,q)ij ^— 1 for all 

This set of rules {R} then defines a subgraph B pq C Q for each pair of characters 
p, q, such that any vertex v £ B pq if and only if R(p, q)^ / v \ = 1. The intersec- 
tion of these subgraphs B = C\ Cp ,c q ecB pq then gives the generalized Buneman 
graph for \ given any set of distance metrics a = {a p \c p £ C}. Note that the 
Buneman graph of any subset of x is a subset of B. It is easily verified that for 
binary characters, our algorithm yields the standard Buneman graph. 

The remainder of this paper will make two contributions. First, it will show 
that the generalized Buneman graph B defined above contains all minimum 
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Fig. 1. An example of the generalized Buneman pruning condition. If all taxa in \ are 
present in the shaded region, vertices in all other blocks can be discarded. 



Stciner trees for the input taxa x- This will in turn establish that restricting the 
search space for minimum Stciner trees to B will not affect the correctness of the 
search. The paper will then empirically demonstrate the value of these methods 
to efficiently finding minimum Steiner trees in practice. 

Before we prove that all Steiner minimum trees connecting the taxa are 
displayed in B, we need to introduce the notion of a neighborhood decomposition. 
Suppose we are given any tree T(V,E) displaying the set of taxa x- We will 
contract each degree- two Steiner node (i.e., any node that is not present in x) an d 
replace its two incident edges by a single weighted edge. Such trees are called X- 
Trees [14]. Each X-Tree can be uniquely decomposed into its phylogenetic X-Tree 
components, which are maximal subtrees whose leaves are taxa. Formally, each 
phylogenetic X-Tree P(i/j) consists of a set of taxa ip C x and a tree displaying 
them, such that there is a bijection or labeling 77 : lp — J> ip between elements 
of ip and the set of leaves lp G P{ip) [14] (Fig 2) . All vertices in P(ip) with 
degree 3 or higher will be called branch points. From now on we will assume that 
given any input tree, such a decomposition has already been performed (Fig 2). 
Two phylogenetic X- Trees P(ip) and P'{ip) are considered equivalent if they have 
identical length and the same tree topology. By identical tree topology, we mean 
there is a bijection between the edge set of the two trees, such that removing any 
edge and its image partitions the leaves into identical bi-partitions. We define 
two trees to be neighborhood distinct if after neighborhood decomposition they 
differ in at least one phylogenetic X-Tree component. We define a labeling of 
the phylogenetic X-Tree as an injective map r : P — s> Q between the vertices 
of P(tp) and those of the graph Q such that P u represents the character string 
for the image of vertex u in Q. Since leaf labels are fixed to be the character 
strings representing the corresponding taxa, Pt = rjt G ip for any leaf t £ lp. 
Identical phylogenetic X- Trees can, however, differ in the labels r u of internal 
branch points u G P\lp. 

We will use a generalization of the Fitch-Hartigan algorithm to weighted 
parsimony proposed by Erdos and Szekely [15,16]. The algorithm uses a simi- 




Fig. 2. An input tree and its phylogenetic X-Tree components,with taxa labelled by 
integers. 



lar forward pass/backward pass technique to compute an optimal labeling for 
any phylogenetic X-Tree T(ijj). Arbitrarily root the tree T(ip) at some taxon £ 
and starting with the leaves compute the minimum length minL(rb,Tb) of any 
labeling of the subtree Tf, consisting of the vertex b and its descendants, where 
the root b is labeled as follows. 

1. If It labels a leaf 775 G ip, minLiTb = 77^ , ) = and 00 otherwise. 

2. If b has k children Db = {v\, . . .Vk}, and T v is the subtree consisting of 
v G Db and its descendants, 

minL(r b ,T b ) = mm{minL(r v , T v ) + d a [r b , r v } } (3) 

where the minimum is to be taken over all possible labels r v for each char- 
acter and for each child v G D b . 

The optimal labeling of T(ip) is one which minimizes the length at the root: 
L(T) — minL(r]^,T^). Labels for each descendant are inferred in a backward 
pass from the root to the leaves and using equation 3. Note that the minimum 
length of a tree is just the sum of minimum lengths for each character, i.e., 
minL{r b ,T b ) = ^ c eC minL(rb,Tb)^ s \ where mm£(il,,T(,)W is the minimum 
cost of tree Tf, rooted at b for character c s . 

Briefly, our proof is structured as follows: Given any phylogenetic X-Tree 
T(ip) labeling (typically denoted T below), we will show that the generalized 
Buneman pruning algorithm for each pair of characters (c p , c q ) defines a sub- 
graph B pq which contains at least one possible labeling of no higher cost (typ- 
ically denoted <P below) for T(ip). We will then show that the intersection of 
these subgraphs B = n p ^ q B pq thus contains an optimal labeling for T(ip). 

If the pruning condition in step 2 of the algorithm that defines the Buneman 
graph is not implemented for the pair of characters (c pi c q ), then B pq — Q and 
all labels are necessarily inside B pq . We prove the following lemma for the case 
when the pruning condition is satisfied, ie., there exist unique states i pq of c p 
and i qp of c q , such that each element in the set of leaves It = {t G T(ip)\r)t G ip} 
either has (rjt) p = i pq or (r/ t ) q = i qp or both. Each time we relabel vertices, we 
will keep all characters except c p and c q fixed. To economize our notation, we will 
represent the sum of costs in c p and c q of the tree T labeled by T, which has some 




Fig. 3. (a) The base case of a degree \tj)\ star that can be attached to a parent vertex 
( in the Erdos-Szekely algorithm, (b) T(ip) for the general case (see Lemma 1) 



branch point b as the root, simply by writing L(r, T) = L(T, T)W + L(T, T)^). 
We use the notation F x = [(r x ) p , {r x ) q \ to represent the label for a vertex x and 
suppress the state of all other characters. 

Lemma 1. Given any phylogenetic X-Tree T(ip) with ip C B pq , and a labeling 
r, such that an internal branch point b G T \ It is labeled outside B pq , i.e., 
I], ^ B pq , there exists an alternate labeling <P ofT{ip) inside B pq such that 

1. either L(r,T) > L(<P,T) + d a [r b ,<P b ], or - 

2. L(r,T) > L(<P,T) for each of the following choices: <P b — [i pq ,i qp ] or [i pq , (r b ) q ] 
or {(r b ) p ,i qp \, and <P V — r v for all v ^ b. We will call a tree that satisfies 
this second condition a (c p ,c q )-Tree 

Proof. We will use induction on the number of internal branch points outside 
B pq to prove the claim. Without loss of generality we can consider all branch 
points of T(t/j) to be labeled outside B pq . If some branch points are labeled 
inside B pq then they can be treated as leaves of smaller X-Tree(s) that have all 
branch points outside B pq . This is similar to the neighborhood decomposition 
we performed earlier for those branch points that were present in the set of input 
taxa. The set of branch points is then the set T \ It = {u 6 T\T U ^ B pq }. 

For the base case assume all the leaves are joined at a single branch point b 
to form a star of degree \tp\ (see Fig. 3(a) without the root Q. We can group the 
leaves into three sets: 

1- I = {r)u = [ipq,Uu]\yu # i qp ,riu £ ^1 

2. II = {rj v = [x v ,i qp ]\x v ^ i pq ,r) v G V} 

3. Ill = {i] w = [i pq ,iqp]\r) w G V} 

The cost of the tree for c p and c g , with branch point r b = [x,y], is 

L(r, t) w + L(r, t) («) = Y, ( a P [x, i Pq ] + aq[y,yu}) + Y ( a p t x ' x «] 

uei veil 

+ a q [y,iqp\)+ Y (a P [x,ipq] +a q [y,i qp \) (4) 
wein 

The only way for L(r,T)W + L(r b ,T)^ to be minimum with x # i pq and 
y i^iqp, is if III = and |/| = \II\. For contradiction, suppose |/| + \III\ > \II\. 



We could then define a labeling <P identical to T over all characters, except 
@b = [ipqiV]i such that d a [r b , $b] = ce p [r b , $b]- We could then reduce the length, 
since 

L(r,T) {p) = ^2a p [x,i pq ] + ^ a p [x,x v ] + a p [x,i pq ] 
uei veil wein 

^ Clp[x, i p q\ ~\~ ^ y , iptp Xy] ~\~ OL p [x, ^pg]) 
veil 

> a p [x,i pq ] + apiipq^v] = L($,T)M +d a [r b ,$ b ] (5) 
veil 

where the last inequality follows from the triangle inequality. Similarly, if \II\ + 
\III\ > |/|, we could define <Z> 6 = [x,i qp ] and arrive at L(r,T)M > L($,T)^ + 
d a [r b ,<P b }. 

On the other hand if |/| = \II\ and 777 = setting $ b = [i pq ,y] or $ b = 
[x,i qp ] or <P b = [i pq ,i qp ] all achieve a length no more than L(r,T)^ + L(r,T)( q \ 
Therefore, this is a (c p , c g )-Tree. This proves the base case for our proposition. 

We will now assume that the claim is true for all trees with n branch points 
or less. Suppose we have a labeled tree T(ip) with n + 1 branch points which 
are all outside B pq . Let D b = {v\, . . . Vk} be the children of a branch point 6 
in T(tp) and {Ti, . . .Tk} be the subtrees of each v e D b and their descendants. 
Note that some of these descendants may be leaves. Since T(ip) has at least two 
branch points, one of its descendants (say v\) must be a branch point (Fig 3(b)). 
Let T b = T \ Ti be the subtree consisting of b and all its other descendants. For 
clarity we will use the notation r b — [x b ,y b ] and r vi — [x\,yi]. This implies, 

L(r, t) = L(r, n) + L(r, t x ) + d a \r b , r vi ] 

= L{r, T b ) + L(r, TO + a p [x b , Xl ] + a q [y bl y{\ (6) 
There are four possibilities. 

1. Both T b and T\ are (c p , c 9 )-Trees with n or less branch points - In this case, 
by induction, both T b and T\ can be relabeled with <P b and (l> Vl of the form 
[ipqiiqp]- Since the cost in c p and c q of the edge (b,v\) is now zero, we have 
an optimal labeling of T(tp) within B pq and L(r, T) > L(<P,T) . Note that 
each of the choices of the form [i pq ,yi] or for relabeling of 6 also 
satisfy property 2 of the claim. Therefore, this is a (c p , c q )-Tree. 

2. T b is a (c p , c 9 )-Tree, but Ti is not. Therefore, there is a labeling <P of T\ with 
either $ Vl = [i pq ,yi] and/or <P 

vi — [^li^pq] such, that 
L(r, Ti) > L(<P, Ti) + d a [r vi , <P V1 ] (7) 

Let us assume for concreteness that <P Vl — [i pqi y{\. It will become clear that 
the argument works for the other possible choices. Since, T b is a (c p , c g )-Tree, 
by induction, we can choose a labeling of T b with <P b — [i pq , y b ] , such that 
L(r,T b ) > L(4>,T b ). This gives 

L($, T) = L{$, T b ) + L($, Ti) + d a [<P 6) 

= L($,T b )+L($,T 1 )+a q [y b ,y 1 ] (8) 



Comparing the previous two equations with equation 6, we get, 

L(r,T) = L(T,T b ) + L{r,Ti) + a p [x bl xi] + a q [y b ,y{\ 

> L(4>,T b ) + L(^,Ti) + d a [r vi ,$ Vl ] +a p [x b ,x 1 ] +a q [y b , yi ] 
= L(<P,T b ) +L(#,Ti) +a p [x 1 ,i pq ] + a p [x b , x x ] + a q [y b ,yi] 

Up [x b , ipq ] + a q [y b ,yi] 
= L{$,T b ) + L{$,T 1 ) + d a [T b , b ] + d a [$ b , $ V1 ] 
= L($,T) + d a [r b ,$ b ] (9) 

which satisfies the first possibility of the claim. It should be clear that if 
&V! — [xi,iqp] then the choice <P b = [x b ,i qp ] would give an identical bound. 

3. T\ is a (cp, c g )-Tree, but T b is not. This case is similar to the previous one. 
Since T b has less than n branch points, which are all outside B pq , and it is 
not a (c p , c g )-Tree, we have from induction a labeling <P of T b with either 
$b = [ipq, Vb] and/or $ b = [x b , i pq ] such that 

L(r,T b ) > L($,T b ) + d a [r b ,$ b ] (10) 

As before, let us assume <P b = [i pq , y b ] for concreteness. Since T\ is a (c p , c q )- 
Tree, we can choose a labeling with 4> Vl = [i pq ,yi] such that L(r,T\) > 
L(#,7i). This gives, 

L(<2>, T) = L(#, T b ) + L(#, TO + d a [$ b , <2>„J 

= L(#,T 6 )+L(#,T 1 )+a,| J , t) yi] (11) 

Comparing the previous two equations with equation 6, we get, 

L(T,T) = L(T,T b ) +L(r,Ti) +a p [x b ,x 1 ] +a q [y b ,y 1 ] 

> L{$,T b ) + L{$,T{) + d a [r b ,$ b ] +a p [x b ,x 1 ]+a q [y b ,y 1 ] 

> L{$, T b ) + L{$, Ti ) + d a [T b , <Z> h ] + d a [# 6 , $ V1 } 

= L($,T) + d a [r b ,$ b ] (12) 

An identical argument carries through if <P b — [x b ,i qp }. 

4. Neither T\ or T b are (c p , c 9 )-Trees. It follows from induction that there is 
a labeling $ such that L(r,T b ) > L{<P,T b ) + d a [T b ,<P b ] and L{T,Ti) > 
L(<P,Ti) + d a [r Vl ,<l> Vl }. There are two possibilities in this case. 

(a) (<P b = [i P q,y b ] and <P Vl = [i pq ,yi]) or (<P b = [x b ,i qp ] and <P Vl = [xi,i qp ]). 
As before, we will prove the claim for the former possibility while the 
later case can be proved by an identical argument. 

L(*, T) = L{$, T b ) + L(<2>, Ti) + d a [$ b , $ V1 ] 

= L($,T b )+L(4>,T 1 ) + a q [y b ,y 1 ] (13) 



L{T, T) = L(T, T b ) + L(T, T^ + a p [x b , xi] + a q [y b , yi ] 



> L(#, T b ) + L{$, Ti) + rf« [A, <Z> 6 ] + d a [r wi , 
+ a p [xb,xi] + a q [yb,yi] 

> L{$, T b ) + L($, Ti) + d a [r b , <Z> fc ] + a g [y&, Vl ] 
= L($, T b ) + L(«P, Ti) + d a [r b , <P b ] + d a [<P b , <P V1 ] 

= L(<P,T)+d a [r b ,<P b ] (14) 

This also satisfies the claim. The proof for <P b = [x b , i qp ] and <P Vl = 
[xi,iqp] is identical, 
(b) (<P b = [i pq ,y b ] and <P Vl = [xx, i qp \) or (<P b = [x b ,i qp ] and $ Vl = [i pq ,yi]). 
As before, we show the calculation for the former possibility. In this case 

L(<P, T) = L(<P, T b ) + L(«P, TO + d a [0 b , <P V1 ] 

= L(#, T b ) + L(4>, T^ + a p [x b , i pq ] + a q [i qp , y{\ (15) 

Combining this with equation 6 we get, 

L(T,T) = L(r,T b ) + L{r,T{) + a p [x b ,xi] + a q [y b ,yi] 

> L{$, T b ) + L(«P, Ti) + d a [r b , $ b ] + d a [r vi , 
+ a p [a;{„xi] + a 9 [y 6 ,2/i] 

+ a p [x b ,xi] + a q [y b ,y\] 

> L(<P,T b ) + L(<P,Tx) + a p [x b ,i pq ] + a q [i qp ,y\] 

= L($,T b )+L{$,T 1 )+d a [$b,$v 1 ] =L($ b ,T) (16) 

But if we now relabel b and vi with <P Vl — [i pq ,i qp ] and <t> b = [i pq ,i qp ] 
while <P V = <P V for all other v, we get L(#,Ti) + a 9 [yi,ig P ] > L{i> Vl ,Ti) 
and L(<P,T b ) + a P [x b , i pq ] > L(<P,T b ). This immediately gives, 

L(#, T) = L(#, T 6 ) + L(#, Ti) + d a [$ b , K\ 

> L(<P, T) > L(r, T) (17) 

Identical arguments work for the choices $ Vl = [xi,i qp ] and<£>6 = [x b ,i qp \. 

This proves that if either of the two possibilities claimed are always true for an 
X-Tree with n branch points or less then they are also true for a tree with n + 1 
branch points. The proof for arbitrary n follows from induction. □ 

Corollary 1. Given a minimum length phylogenetic X-Tree T(ip) there is an 
optimal labeling for each branch point within B. 

Proof. Lemma 1 establishes that for any minimum Steiner tree labeled by T and 
any branch point b G T such that T b £ B pq , an alternative optimal labeling <P 
exists such that <P b is inside the union of blocks 



A(r b ,p,q) = [c P (i pq )c q (i qp )] U [c P (i pq )c q ((r b ) q )} U [c p ((r b ) p )c q (i qp )} 



If we root the tree at b, the new optimal labeling for all its descendants is 
inferred in a backward pass of the Erdos-Szekcly algorithm. This ensures that 
each branch point in a minimum length phylogenetic X-Tree is labeled inside 
B pq . Let Sb = <^B pg jtgA(rb,p,q) C B, where the intersection is taken over all 
pair of characters for which the pruning condition is satisfied. It follows from 
Lemma 1 that Sb also contains an alternate optimal labeling of T(^>). Note that 
Sb is a non-empty subset of B. This must be true because given a character pair 
c p ,c q , each union of blocks contains at least one taxon and so the rule matrix 
R(j), q) that defines the Buneman graph must have ones for each of these blocks. 
Therefore each element in Sb represents a distinct vertex of the Buneman graph. 



As argued before, any minimum Steiner tree can be decomposed uniquely into 
phylogenetic X-Tree components and the previous corollary ensures that each 
phylogenetic X-Tree can be labeled optimally inside the generalized Buneman 
graph. It follows that all distinct minimum Steiner trees are contained inside the 
generalized Buneman graph. 



Integer Linear Program (ILP) Construction 

We briefly summarize the ILP flow construction used to find the optimal phy- 
logeny. We convert the generalized Buneman graph into a directed graph by 
replacing an edge between vertices u and v with two directed edges (u, v), (v, u) 
each with weight w uv as determined by the distance metric. Each directed edge 
has a corresponding binary variable s u . v in our ILP. We arbitrarily choose one of 
the taxa as the root r, which acts as a source for the flow model. The remaining 
taxa T = x — {r} correspond to sinks. Next, we set up real- valued flow variables 
f l u v , representing the flow along the edge (u, v) that is intended for terminal t. 
The root r outputs \T\ units of flow, one for each terminal. The Steiner tree is 
the minimum-cost tree satisfying the flow constraints. This ILP was described 
in [4], and we refer the reader to that paper for further details. The ILP for this 
construction of the Steiner tree problem is the following: 



□ 




w. 



{u,v)£B 



subject to Y^ifiv ~ ft, J 







VueB\{t,r}, VteT 



V 




1 



Vt e T 



V 



< fl v < s UtV 
s u ,v e {0, 1} 



V(u, v)eB,yteT 
V(u,«) g B 



(18) 



Table 1. Pruning and run time results for the data sets reported. 



Data 


Input 

(raw) 


Complete 
graph 


\B\ 


ILP 

length time 


pars 

length time 


PAUP* 

length time 


0. rufipogon DNA 


41 x 1043 


2 LH * 3 A 


58 


57 0.29s 


57 2.57s 


57 2.09s 


Human mt-DNA 


80 x 245 




64 


44 0.48s 


45 0.56s 


44 5.69s 


HIV-1 RT protein 


50 x 176 


2 lb * 3 * A 2 


297 


40 127.5s 


42 0.30s 


40 3.84s 


mt3000 


500 x 3000 


2 y9 * 3' 


322 


177 40s 


178 2m37s 


177 5h23m 


mt5000a 


500 x 5000 


2 ibv * 3' 


1180 


298 5hl0m 


298 35m49s 


298 3h52m 


mt5000b 


500 x 5000 


2^ y * 3 ;! 


360 


312 3m41s 


312 57m6s 


312 2h40m 


mt 10000 


500 x 10000 


2 Jb ' *3 b 


6006 


N. A. N. A. 


637 lh34m 


637 lh39m 



Results 

We implemented our generalized Buneman pruning and the ILP in C++. The 
ILP was solved using the Concert callable library of CPLEX 10.0. We compared 
the performance of our method with two popular heuristic methods for maximum 
parsimony phylogeny inference — pars, which is part of the freely-available 
PHYLIP package [13], and PAUP* [12], the leading commercial phylogenetics 
package. We attempted to use PHYLIP's exact branch-and-bound method DNA 
penny for nucleotide sequences, but discontinued the tests when it failed to solve 
any of the data sets in under 24 hours. In each case, pars and PAUP* were run 
with default parameters. We first report results from three moderate-sized data 
sets selected to provide varying degrees of difficulty: a set of 1,043 sites from a 
set of 41 sequences of O. rufipogon (red rice) [10], 245 positions from a set of 80 
human mt-DNA sequences reported by [11], and 176 positions from 50 HIV-1 
reverse transcriptase amino acid sequences. The HIV sequences were retrieved 
by NCBI BLASTP [17] searching for the top 50 best aligned taxa for the query 
sequence CI 19571541 and default parameters. We then added additional tests 
on larger data sets all derived from human mitochondrial DNA. The mtDNA 
data was retrieved from NCBI BLASTN, searching for the 500 best aligned taxa 
for the query sequence GI 61287976 and default parameters. The complete set 
of 16,546 characters (after removing indels) was then broken in four windows 
of varying sizes and characteristics: the first 3,000 characters (mt3000), the first 
5,000 characters (mt5000a), the next 5,000 characters (mt5000b), and the first 
10,000 characters (mtlOOOO). Table 1 summarizes the results. 

For the set of 41 sequences of lhs-1 gene from O. rufipogon (red rice) [10], our 
method pruned the full graph of 2 18 * 3 2 nodes (after screening out redundant 
characters) to 58. Fig 4(a) shows the resulting phylogeny. Both PAUP* and pars 
yielded an optimal tree although more slowly than the ILP (2.09 seconds and 
2.57 seconds respectively, as opposed to 0.29 seconds). 

For the 245-base human mt-DNA sequences, the generalized Buneman prun- 
ing was again highly efficient, reducing the state set from 2 28 after removing 
redundant sequences to 64. Fig 4(b) shows the phylogeny returned. While PAUP* 
was able to find the optimal phylogeny (although it was again slower at 5.69 sec- 




Fig. 4. Most parsimonious phylogenies (a) lhs-1 gene for O. rufipogon [10] (b) Human 
mt-DNA [11] and (c) HIV-1 RT proteins [17]. Edges are labelled by their lengths in 
parentheses followed by sites that mutate along that edge. Dark red ovals are input 
taxa and light blue Steiner nodes. 



onds versus 0.48 seconds), pars yielded a slightly sub-optimal phytogeny (length 
45 instead of 44) in a comparable run time (0.56 seconds). 

For HIV-1 sequences, our method pruned the full graph of 2 16 * 3 * 4 2 possible 
nodes to a generalized Buneman graph of 297 nodes, allowing solution of the ILP 
in about two minutes. Fig 4(c) shows an optimal phytogeny for the data. PAUP* 
was again able to find the optimal phytogeny and in this case was faster than 
the ILP (3.84 seconds as opposed to 127.5 seconds), pars required a shorter run 
time of 0.30 seconds, but yielded a sub-optimal tree of length of 42, as opposed 
to the true minimum of 40. 

For the four larger mitochondrial datasets, Buneman pruning was again 
highly effective in reducing graph size relative to the complete graph, although 
the ILP approach eventually proves impractical when Buneman graph sizes 
grows sufficiently large. Two of the data sets yielded Buneman graphs of size 
below 400, resulting in ILP solutions orders of magnitude faster than the heuris- 
tics. mt5000a, however, yielded a Buneman graph of over 1,000 nodes, resulting 
in an ILP that ran more slowly than the heuristics. mtlOOOO resulted in a Bune- 



man graph of over 6,000 nodes, leading to an ILP too large to solve, pars was 
faster than PAUP* in all cases, but PAUP* found optimal solutions for all three 
instances we can verify while pars found a sub-optimal solution in one instance. 

We can thus conclude that the generalized Buneman pruning approach de- 
veloped here is very effective at reducing problem size, but solving provably to 
optimality does eventually become impractical for large data sets. Heuristic ap- 
proaches remain a practical necessity for such cases even though they cannot 
guarantee, and do not always deliver, optimality. Comparison of PAUP* to pars 
and the ILP suggests that more aggressive sampling over possible solutions by 
the heuristics can lead optimality even on very difficult instances but at the cost 
of generally greatly increased run time on the easy to moderate instances. 

Discussion 

We have presented a new method for finding provably optimal maximum parsi- 
mony phylogenies on multi-state characters with weighted state transitions, us- 
ing integer linear programming. The method builds on a novel generalization of 
the Buneman graph for characters with arbitrarily large but finite state sets and 
for arbitrary weight functions on character transitions. Although the method has 
an exponential worst-case performance, empirical results show that it is fast in 
practice and is a feasible alternative for data sets as large as a few hundred taxa. 
While there are many efficient heuristics for recontructing maximum parsimony 
phylogenies, our results cater to the need for provably exact methods that are 
fast enough to solve the problem for biologically relevant multi-state data sets. 
Our work could potentially be extended to include more sophisticated integer 
programming techniques that have been successful in solving large instances of 
other hard optimization problems, for instance the recent solution of the 85,900- 
city traveling salesman problem pla85900 [18]. The theoretical contributions of 
this paper may also prove useful to work on open problems in multi-state MP 
phylogenetics, to accelerating methods for related objectives, and to sampling 
among optimal or near-optimal solutions. 
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